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ABSTRACT 

In  TREC  2009,  we  extend  our  Voting  Model  for  the  faceted  blog 
distillation,  top  stories  identification,  and  related  entity  finding  tasks. 
Moreover,  we  experiment  with  our  novel  xQuAD  framework  for 
search  result  diversification.  Besides  fostering  our  research  in  mul¬ 
tiple  directions,  by  participating  in  such  a  wide  portfolio  of  tracks, 
we  further  develop  the  indexing  and  retrieval  capabilities  of  our 
Terrier  Information  Retrieval  platform,  to  effectively  and  efficiently 
cope  with  a  new  generation  of  large-scale  test  collections. 

1.  INTRODUCTION 

In  TREC  2009,  we  participate  in  the  Blog,  Entity,  Million  Query, 
Relevance  Feedback  and  Web  tracks.  This  year,  we  have  further 
developed  our  Terrier  IR  platform  [29]  with  regards  to  efficiency 
and  effectiveness  for  the  newly  introduced  large-scale  collections. 
Participation  in  such  a  wide  portfolio  of  tracks  allows  us  to  compre¬ 
hensively  evaluate  Terrier  in  a  challenging  environment.  Our  pri¬ 
mary  research  directions  focus  on  further  applications  for  the  Vot¬ 
ing  Model  [20],  as  well  as  on  experimenting  with  our  novel  xQuAD 
framework  for  search  result  diversification  [34,  35]. 

In  the  faceted  blog  distillation  task  of  the  Blog  track,  we  in¬ 
vestigate  how  machine  learning  techniques  can  be  used  to  address 
faceted  blog  ranking.  In  particular,  on  top  of  a  Voting  Model-based 
blog  retrieval  system,  we  devise  a  large  set  of  features,  and  inves¬ 
tigate  the  effectiveness  of  formulating  the  faceted  blog  distillation 
problem  as  a  text  classification  or  a  learning-to-rank  problem. 

For  the  Blog  track  top  news  stories  identification  task,  we  iden¬ 
tify  the  most  important  headlines  for  each  day,  by  using  the  Voting 
Model.  In  particular,  we  believe  that  the  number  of  blog  posts  men¬ 
tioning  a  headline  (aka  votes)  is  a  good  indicator  of  the  importance 
of  each  headline.  However,  as  the  blogosphere  exhibits  a  bursty  na¬ 
ture,  we  examine  how  to  make  use  of  the  fact  that  important  head¬ 
lines  can  persist  over  a  period  of  days.  Lastly,  we  identify  a  set 
of  novel  yet  relevant  blog  posts  for  each  headline,  by  diversifying 
these  blog  posts  based  on  temporal  distance  or  content  similarity. 

In  the  Entity  track,  we  extend  the  Voting  Model  to  the  task  of 
finding  related  entities,  by  considering  the  co-occurrence  of  query 
terms  and  candidate  entities  in  a  document  as  a  vote  for  the  strength 
of  the  relationship  between  these  entities  and  the  query  entity.  In 
addition,  we  experiment  with  novel  graph-based  techniques,  in  or¬ 
der  to  promote  entities  associated  to  authoritative  documents  or 
documents  from  the  same  community  as  the  query  entity. 


For  the  diversity  task  of  the  Web  track,  we  experiment  with  our 
novel  xQuAD  diversification  framework,  based  on  the  explicit  ac¬ 
count  of  the  possible  aspects  underlying  a  query,  in  the  form  of 
sub-queries  [34,  35].  In  particular,  we  investigate  the  effectiveness 
of  exploiting  query  suggestions  provided  by  a  major  Web  search 
engine  as  sub-queries  within  our  proposed  framework. 

Lastly,  in  our  participations  in  the  Web  track  adhoc  task,  the  Rel¬ 
evance  Feedback  track  and  the  Million  Query  track,  we  test  the  ef¬ 
fectiveness  and  efficiency  of  Terrier  on  the  large-scale  ClueWeb09 
corpus.  In  particular,  we  test  and  further  enhance  our  MapReduce- 
based  indexing  implementation  in  Terrier  [27,  28],  and  deploy  dis¬ 
tributed  retrieval  techniques  [6,  32]  to  permit  efficient  experimen¬ 
tation  on  this  new,  large-scale  corpus. 

The  remainder  of  this  paper  is  structured  as  follows.  Section  2 
describes  the  corpora  used  in  our  participation,  along  with  the  as¬ 
sociated  indexing  and  retrieval  strategies  we  employ.  Section  3  de¬ 
fines  the  models  we  use  for  retrieval  and  relevance  feedback,  and 
also  introduces  the  Voting  Model.  Sections  4  and  5  cover  our  par¬ 
ticipation  in  the  Blog  track  faceted  blog  distillation  and  top  stories 
tasks,  respectively.  Our  participation  in  the  Entity  track  is  discussed 
in  Section  6.  Sections  7  and  8  discuss  our  work  in  the  adhoc  and 
diversity  tasks  of  the  Web  track,  respectively.  Sections  9  and  10 
present  our  hypotheses  and  results  for  the  Million  Query  and  Rel¬ 
evance  Feedback  tracks,  respectively.  Lastly,  Section  1 1  provides 
concluding  remarks  and  directions  for  future  research. 

2.  INDEXING  &  RETRIEVAL 

The  test  collection  for  the  Blog  track  is  the  new  TREC  Blogs08 
collection,  which  is  a  crawl  of  the  blogosphere  over  a  54-week 
period  [25].  During  this  time,  the  blog  posts  (permalinks),  feeds 
(RSS/Atom  XML)  and  homepages  of  each  blog  were  collected.  In 
our  participation  in  the  Blog  track,  we  index  only  the  permalinks 
component  of  the  collection.  In  particular,  there  are  approximately 
28  million  documents  in  this  component. 

For  the  Entity,  Million  Query,  Relevance  Feedback,  and  Web 
tracks,  the  test  collection  is  the  new  billion  document  TREC  Clue- 
Web09  collection,  which  has  an  uncompressed  size  of  25TB.'  We 
index  this  collection  in  two  manners.  Firstly,  the  so-called  ‘cate¬ 
gory  B’  subset,  containing  50  million  English  documents,  and  sec¬ 
ondly  all  500  million  English  documents  (the  ‘category  A’  subset). 

'http : / /boston . Iti . cs . emu . edu/Data/ clueweb09 
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Stage 

Input  Output 

Map 

Reduce 

Document  (Term,  PostingList) 

Term,  list[PostingList]  Inverted  index 

Table  1:  Overview  of  the  MapReduce  functions  used  during 
indexing. 

For  indexing  purposes,  we  treat  the  above  two  collections  in  the 
same  way.  Using  the  Terrier  IR  platform^  [29],  we  create  content- 
based  indices,  including  the  document  body  and  title.  Each  term 
is  stemmed  using  Porter’s  English  stemmer,  and  standard  English 
stopwords  are  removed.  In  both  cases,  we  use  our  distributed  MapRe¬ 
duce  indexing  implementation  in  Terrier  [27,  28].  The  indexing 
process  is  split  into  many  ‘map’  tasks  over  the  input  data,  fol¬ 
lowed  by  one  or  more  reduce  tasks  to  create  the  final  inverted  index 
shards.  In  particular.  Table  1  gives  an  overview  of  the  map  and  re¬ 
duce  functions  used  in  our  implementation.  Each  map  task  takes 
as  input  a  document  to  be  indexed,  and  processes  that  document, 
building  up  a  miniature  inverted  index  in  memory.  When  memory 
is  exhausted,  the  mini-inverted  index  is  emitted  from  the  map  task 
to  disk,  in  the  form  of  (Term,  PostingList)  tuples.  Each  reduce  task 
takes  as  input  several  posting  lists  for  a  given  term,  and  merges 
these  into  the  final  inverted  index.  Note  that  the  number  of  reduce 
tasks  defines  the  number  of  inverted  index  shards  created.  Eor  more 
details  and  comparative  experiments,  see  [27]. 

We  use  a  distributed  version  of  Terrier  to  speed  up  retrieval  for 
large  corpora.  In  particular,  we  use  distributed  retrieval  for  retriev¬ 
ing  documents  from  the  ClueWeb09  category  A  corpus  (500  mil¬ 
lion  documents).  Eollowing  [6,  32],  our  system  uses  one  query 
server  to  serve  results  from  one  or  more  document-partitioned  in¬ 
dex  shards,  while  a  centralised  query  broker  is  responsible  for  pass¬ 
ing  the  query  to  each  query  server,  and  merging  the  results.  More¬ 
over,  the  process  for  each  query  follows  two  phases.  Eirstly,  the 
query  is  tokenised,  and  each  term  is  passed  to  the  query  servers 
to  obtain  their  local  statistics  for  the  term.  These  local  statistics 
are  merged  by  the  broker,  so  that  accurate  global  statistics  are  ob¬ 
tained.  In  the  second  phase,  the  query  servers  score  and  rank  their 
documents,  making  use  of  the  global  statistics.  Einally,  the  docu¬ 
ments  from  each  query  server  are  merged  into  a  single  ranking  hy 
the  broker.  During  merging,  no  score  normalisation  is  necessary, 
as  the  retrieval  approach  applied  by  each  query  server  is  identical, 
using  exactly  the  same  global  statistics. 

3.  MODELS 

The  main  weighting  model  used  in  our  TREC  2009  participation 
is  the  DPFI  model,  which  is  derived  from  the  Divergence  Erom  Ran¬ 
domness  (DER)  framework  [1].  Using  DPH,  the  relevance  score  of 
a  document  d  for  a  query  Q  is  given  by  [2]: 

score{d,Q) 

t£Q  ^ 

+  0.5  -  log2(27r  -tf  ■  {1-  F))  (1) 

where  F  is  given  by  tf/l,  tf  is  the  within-document  frequency,  and 
I  is  the  document  length  in  tokens.  avgJ  is  the  average  document 
length  in  the  collection,  N  is  the  number  of  documents  in  the  col¬ 
lection,  and  TF  is  the  term  frequency  in  the  collection.  Note  that 
DPFI  is  a  parameter-free  model,  and  therefore  requires  no  particular 
tuning,  qtw  is  the  query  term  weight  and  is  given  by  qtf/qtfmax, 
where  qtf  is  the  query  term  frequency  and  qtfmax  is  the  maximum 
query  term  frequency  among  all  query  terms. 

^http : / / WWW .terrier.org 


3 . 1  Terms  Dependence  in  the  Divergence  F rom 
Randomness  Framework 

Taking  into  account  the  dependence  and  proximity  of  query  terms 
in  documents  can  increase  adhoc  retrieval  effectiveness.  To  this 
end,  we  use  an  extension  of  the  DFR  framework  that  can  account 
for  the  dependence  of  query  terms  in  documents  [19,  30].  In  gen¬ 
eral,  when  using  a  term  dependence  model,  the  score  of  a  document 
d  for  a  query  Q  is  given  as  follows: 

score{d,Q)  =  ^^qtw  ■  score{d,t)  +  score{d,p)  (2) 
teQ  peQ2 

where  score{d,t)  is  the  score  assigned  to  a  query  term  t  in  the 
document  d,  p  corresponds  to  a  pair  of  query  terms,  and  Q2  is 
the  set  that  contains  all  possible  combinations  of  two  query  terms. 
In  Equation  (2),  qtw  ■  score{d,  t)  can  be  estimated  by  any 

DFR  weighting  model,  such  as  DPH.  The  score{d,p)  of  a  pair  of 
query  terms  in  a  document  is  computed  as  follows: 

score{d,p)  =  -  log2(Ppi)  •  (1  -  Pp2)  (3) 

where  Ppi  is  the  probability  that  there  is  a  document  in  which  a  pair 
of  query  terms  p  occurs  a  given  number  of  times.  Ppi  can  be  com¬ 
puted  with  any  randomness  model  from  the  DFR  framework,  such 
as  the  Poisson  approximation  to  the  Binomial  distribution.  Pp2  cor¬ 
responds  to  the  probability  of  seeing  the  query  term  pair  once  more, 
after  having  seen  it  a  given  number  of  times.  Pp2  can  be  computed 
using  any  of  the  after-effect  models  in  the  DFR  framework.  The 
difference  between  score{d,p)  and  score{d,t)  is  that  the  former 
depends  on  occurrences  of  the  pair  of  query  terms  p,  while  the  latter 
depends  on  occurrences  of  the  query  term  t. 

This  year,  for  obvious  efficiency  reasons,  we  applied  the  pBiL 
randomness  model  [19],  which  does  not  consider  the  collection  fre¬ 
quency  of  pairs  of  query  terms.  It  is  based  on  the  binomial  random¬ 
ness  model,  and  computes  the  score  of  a  pair  of  query  terms  in  a 
document  as  follows: 

score{d,p)  =  (  -  logj  -  1)!  -Flogjp/! 

+  log2((-l-p/)! 

-p/log2(Pp)  (4) 

-  -  1  -  P/)  logzbp)) 

where  I  is  size  of  document  d  in  tokens,  Pp  =  p'p  =  1  —  Pp, 
and  pf  is  the  frequency  of  the  tuple  p,  i.e.,  the  number  of  windows 
of  size  ws  in  document  d  in  which  the  tuple  p  occurs. 

3.2  Relevance  Feedback 

We  use  a  term  weighting  model  in  the  context  of  the  Relevance 
Feedback  (RF)  track,  and  also  for  pseudo-relevance  feedback  (PRF) 
and  collection  enrichment  (CE)  [18,  21,  31]  in  the  Blog  track.  The 
central  idea  behind  PRE  is  to  assume  that  the  top  documents  re¬ 
turned  for  a  query  are  relevant,  while  in  RE,  a  few  relevant  docu¬ 
ments  are  known.  We  can  then  learn  from  these  feedback  docu¬ 
ments  to  improve  retrieval  performance  through  query  expansion 
or  term  re-weighting.  In  particular,  we  apply  the  Bol  term  weight¬ 
ing  model,  derived  from  the  DER  framework  [1].  This  model  is 
based  upon  Bose-Einstein  statistics  and  works  in  a  similar  fashion 
to  Rocchio’s  relevance  feedback  method  [33].  In  Bol,  the  informa¬ 
tiveness  w{t)  of  a  term  is  given  by: 

w{t)  =  tfx  ■  log2  ^  -f  log2(l  -b  Pn) 

n 


(5) 


where  tfx  is  the  frequency  of  the  term  t  in  the  pseudo-relevant 
set,  Pn  is  given  by  TF  is  the  frequency  of  t  in  the  whole 
collection,  and  N  is  the  number  of  documents  in  the  collection. 

3.3  Voting  Model 

The  Voting  Model  [20]  addresses  the  task  of  ranking  document 
aggregates  instead  of  individual  documents.  In  TREC  2009,  we 
consider  different  types  of  aggregates  for  specific  tasks.  In  the 
faceted  blog  distillation  task  of  the  Blog  track,  blogs  are  repre¬ 
sented  by  aggregates  of  blog  posts,  whereas  in  the  top  stories  iden¬ 
tification  task,  aggregates  of  blog  posts  are  used  to  represent  the 
days  in  which  these  blog  posts  are  published.  Lastly,  in  the  Entity 
track,  entities  are  represented  by  aggregates  of  the  documents  in 
which  they  occur. 

In  all  cases,  we  consider  the  ranking  of  documents  with  respect 
to  the  query  Q,  which  we  denote  R{Q).  The  intuition  is  that  a 
document  aggregate  ranking  with  respect  to  Q  can  be  modelled  as 
a  voting  process,  using  the  retrieved  documents  in  R(Q).  Specif¬ 
ically,  every  document  in  R{Q)  is  possibly  associated  with  one  or 
more  aggregates,  and  these  associations  act  as  votes  for  each  ag¬ 
gregate  to  be  relevant  to  Q.  The  votes  for  each  aggregate  are  then 
appropriately  combined  to  form  the  final  ranking,  taking  into  ac¬ 
count  the  number  of  associated  voting  documents,  as  well  as  their 
relevance  scores.  Importantly,  this  model  is  extensible  and  general, 
and  is  not  collection  or  topic  dependent.  It  should  be  noted  that, 
in  practice,  R(Q)  contains  only  a  finite  number  of  top  documents, 
with  the  size  oi  R{Q)  denoted  \R{Q)\. 

In  [24],  we  defined  twelve  voting  techniques  for  aggregating 
votes  for  candidate  experts  within  the  context  of  the  expert  search 
task,  inspired  by  data  fusion  techniques  and  social  choice  theory. 
In  this  work,  we  use  two  voting  techniques,  namely  Votes,  and  exp- 
CombMNZ.  In  Votes,  the  score  of  an  aggregate  C  with  respect  to  a 
query  Q  is  given  by: 

scorevotes{C,Q)  =  \R{Q)  n  profile(C)\  (6) 

where  \R{Q)  r]profile{C)\  is  the  number  of  documents  from  the 
profile  of  the  aggregate  C  that  are  in  the  ranking  R{Q). 

The  robust  and  effective  expCombMNZ  voting  technique  ranks 
aggregates  by  considering  the  sum  of  the  exponential  of  the  rel¬ 
evance  scores  of  the  documents  associated  with  each  aggregate. 
Moreover,  it  includes  a  component  which  takes  into  account  the 
number  of  documents  in  R{Q)  associated  to  each  aggregate,  hence 
explicitly  modelling  the  number  of  votes  made  by  the  documents 
for  each  aggregate.  In  expCombMNZ,  aggregates  are  scored  as: 

scoreexpCombMNziC,Q)  =  \R(Q)  n  profile{C)\ 

exp{score{d,Q))  (7) 

d  €  K(Q)n  profile(C) 

where  score{d,  Q)  is  the  score  of  document  d  for  query  Q,  as  given 
by  a  standard  weighting  model,  such  as  DPH  (Equation  (1)). 

4.  BLOG  TRACK: 

FACETED  BLOG  DISTILLATION  TASK 

In  the  faceted  blog  distillation  task,  the  goal  is  to  produce  a  rank¬ 
ing  of  blogs  for  a  given  query  that  have  a  recurrent  interest  in  the 
topic  of  the  query,  and  that  also  fulfil  a  required  facet.  In  partic¬ 
ular,  three  facets  are  considered  in  this  task:  indepth,  opinionated, 
and  shallow  [25].  Eor  each  query,  participants  are  required  to  pro¬ 
vide  a  baseline  ranking,  and  two  rankings  fulfilling  the  possible 
inclinations  of  the  facet  associated  to  the  query.  For  instance,  for 
an  indepth-related  query,  besides  a  baseline  ranking,  participants 


should  produce  a  second  ranking,  aimed  at  favouring  indepth  blogs, 
and  a  third  ranking,  aimed  at  favouring  shallow  blogs. 

In  TREC  2009,  we  deploy  different  machine  learning  techniques 
in  order  to  identify  blogs  fulfilling  a  desired  facet  inclination,  from 
a  baseline  ranking  produced  by  the  Voting  Model.  In  particular, 
we  investigate  both  traditional  text  classification  techniques  [36]  as 
well  as  a  state-of-the-art  leaming-to-rank  technique  [39]  in  order  to 
produce  targeted  rankings  for  each  inclination. 

Our  first  approach  to  this  task  builds  upon  traditional  text  classi¬ 
fication.  By  using  four  different  classifiers,  we  estimate  the  extent 
to  which  a  given  blog  matches  the  facet  inclination  of  interest.  In 
particular,  we  use  the  following  classifiers:  Naive  Bayes,  a  deci¬ 
sion  tree  learner  (J48),  logistic  regression,  and  a  Support  Vector 
Machine  (SVM)  classifier  [10].  The  classifier’s  confidence  in  the 
classification  of  a  blog  to  a  particular  inclination  is  then  integrated 
with  the  baseline  relevance  score  using  FLOE  [9].  In  our  second 
approach,  we  employ  the  AdaRank  [39]  leaming-to-rank  algorithm 
to  produce  a  ranking  model  for  each  inclination. 

To  enable  both  approaches,  we  devise  a  set  of  18  features,  cal¬ 
culated  from  individual  blog  posts  as  well  as  entire  blogs,  for  the 
facets  considered  in  this  task.  For  example,  intuitively,  long  posts 
or  sentences  should  reflect  a  more  indepth  blog,  whereas  having 
only  a  single  author  or  having  offensive  words  should  likely  con¬ 
stitute  positive  indicators  of  a  personal  blog.  Additionally,  for  the 
opinionated  facet,  we  repurpose  our  effective  post-level  opinion  de¬ 
tection  techniques  [12,  13],  deployed  in  previous  Blog  tracks  [11, 
14],  in  order  to  produce  blog-level  opinion  features.  Despite  being 
motivated  by  our  intuitions  regarding  specific  facet  inclinations,  we 
do  not  restrict  the  use  of  these  features  to  the  identification  of  blogs 
fulfilling  these  inclinations,  but  instead  let  our  deployed  approaches 
decide  whether  and  how  to  use  each  feature.  Additionally,  for  our 
leaming-to-rank  approach,  negated  versions  of  all  features  are  also 
considered,  so  as  to  allow  the  learner  to  decide  whether  a  highly 
weighted  feature  should  be  considered  a  positive  or  a  negative  indi¬ 
cator  of  a  particular  facet  inclination  (for  instance,  is  a  long  average 
sentence  length  a  good  or  bad  feature  for  an  indepth  blog?).  Finally, 
both  text  classification  and  leaming-to-rank  approaches  are  trained 
using  a  few  annotated  examples  of  blogs  that  fulfil  each  facet  incli¬ 
nation,  gathered  from  the  TREC  BlogsOb  collection  [22]. 

We  submit  four  mns  to  the  faceted  blog  distillation  task,  as  de¬ 
scribed  next  and  summarised  in  Table  2.  All  mns  use  the  DPH 
weighting  model  (Equation  ( 1 ))  and  the  expCombMNZ  voting  tech¬ 
nique  (Equation  (7))  to  create  an  initial  ranking  of  blogs,  which  are 
then  re-ranked  to  match  a  particular  facet  inclination. 

1 .  uogTrFBNclas  uses  the  confidence  scores  provided  by  a  naive 
Bayes  classifier  to  re-rank  blogs  for  each  facet  inclination. 

2.  uogTrFBMclas  is  similar  to  uogTrFBNclas,  except  that  it 
uses  the  scores  provided  by  the  best  (rather  than  a  single) 
of  our  considered  classifiers  on  a  per-facet  inclination  basis, 
according  to  their  performance  on  the  training  data. 

3.  uogTrFBAIr  uses  the  AdaRank  algorithm  to  learn  a  different 
ranking  model  for  each  facet  inclination. 

4.  uogTrFBHIr  is  similar  to  uogTrFBAIr,  but  uses  intuitively 
set  feature  weights  for  each  facet  inclination,  as  a  baseline. 

Table  3  shows  the  results  of  our  submitted  mns  for  each  of  the 
facets  of  interest.  Performance  is  given  in  terms  of  mean  average 
precision  (MAP)  on  a  per-facet  inclination  basis.  Additionally,  the 
performance  of  our  baseline  ranking  for  each  inclination  is  also 
shown.  Unfortunately,  we  had  an  oversight  on  the  configuration 
of  this  baseline,  which  used  an  extremely  large  |i?(Q)|,  markedly 


Indepth  (18  queries) 

Opinionated  (13  queries) 

Personal  (8  queries) 

Base 

Eirst 

Base 

Second 

Base 

First 

Base 

Second 

Base 

First 

Base 

Second 

TREC  best 

- 

0.1906 

■nwud 

TREC  median 

■anaBl 

- 

0.0250 

lIxS^S 

uogTrEBNclas 

submitted 

0.1033 

0.0652 

0.0259 

0.1012 

0.0988 

WKM 

0.1691 

corrected 

0.1671 

0.0846 

0.0479 

0.0321 

0.1074 

0.1032 

0.1266 

■iVR^ 

0.1219 

0.1733 

0.1710 

overfitted 

0.1671 

0.1735 

0.0479 

0.0113 

0.1074 

0.1274 

0.1266 

BVEitl 

0.1938 

0.1813 

0.1733 

uogTrEBMclas 

submitted 

0.1033 

0.0652 

0.0259 

0.1012 

0.0988 

0.0981 

0.0691 

0.1691 

0.1693 

corrected 

0.1671 

0.1671 

0.0479 

0.0321 

0.1074 

0.1032 

0.1266 

0.0942 

0.1938 

0.1219 

0.1733 

iMlIjn 

overfitted 

0.1671 

0.2032 

0.0479 

0.0113 

0.1074 

0.1274 

0.1266 

0.1420 

0.1938 

0.1813 

0.1733 

uogTrEBAlr 

submitted 

0.1033 

0.0005 

0.0001 

0.1012 

0.0796 

0.1095 

0.0981 

0.0642 

0.1691 

0.1743 

corrected 

0.1671 

0.0059 

0.0479 

0.0255 

0.1074 

0.0422 

0.1266 

0.1254 

0.1938 

0.0947 

0.1733 

0.0848 

overfitted 

0.1671 

0.1589 

0.0479 

0.0444 

0.1074 

0.1048 

0.1266 

0.1271 

0.1938 

0.1227 

0.1733 

0.1521 

uogTrEBHlr 

submitted 

0.1033 

0.1015 

0.0256 

0.1012 

0.0919 

WKM 

0.0981 

0.0739 

0.1691 

corrected 

0.1671 

0.1565 

0.0479 

0.0468 

0.1074 

0.0793 

InBwl 

0.1938 

0.1172 

0.1733 

0.1805 

overfitted 

- 

- 

- 

- 

- 

- 

- 

- 

- 

Table  3:  Per-facet  MAP  performance:  submitted,  corrected,  and  overfitted  runs. 


Run 

Description 

uogTrFBNdas 

uogTrFBMclas 

uogTrFBAlr 

uogTrFBHlr 

DPH+expCombMNZ-l-N  aive 
DPH+expCombMNZ-l-BestClass 
DPH+expCombMNZ-l-AdaRank 
DPH+expCombMNZ-l-Human 

Table  2:  Submitted  runs  to  the  faceted  blog  distillation  task  of 
the  Blog  track. 


compromising  the  performance  of  our  submitted  runs.  Hence,  in 
Table  3,  besides  the  performance  of  each  run,  with  |i?((5)|  =  20,000, 
we  include  an  additional  row  showing  its  attained  performance  after 
correcting  the  baseline  ranking,  with  |i?(Q)|  =  1,000.  Additionally, 
in  order  to  assess  the  impact  of  the  used  training  data.  Table  3  also 
includes  a  row  with  the  performance  of  our  runs  when  overfitted 
using  the  relevance  assessments  for  this  task.^ 

From  Table  3,  we  first  observe  that  the  performance  of  our  sub¬ 
mitted  runs  is  above  median  across  most  settings.  Moreover,  when 
the  corrected  runs  are  considered,  improvements  in  terms  of  base¬ 
line  performance  are  observed  across  all  settings.  The  inclination- 
specific  performance  of  these  runs,  in  turn,  increases  across  most 
settings,  with  the  second  inclination  of  the  personal  facet  being  the 
only  exception.  Nevertheless,  even  after  correcting  our  baseline, 
re-ranking  it  in  order  to  favour  blogs  fulfilling  a  desired  facet  in¬ 
clination  remains  challenging.  We  hypothesise  that  this  is  partially 
due  to  the  insufficient  training  data  we  had  available.  Indeed,  when 
the  overfitted  runs  are  considered,  a  more  comparable  performance 
to  that  of  our  baseline  ranking  is  observed  for  most  settings.  As  for 
the  deployed  approaches  themselves,  the  classification-based  runs 
performed  generally  better  for  the  first  inclination  of  the  indepth 
and  personal  facets,  as  well  as  for  both  inclinations  of  the  opin¬ 
ionated  facet,  whereas  our  approach  based  on  leaming-to-rank  was 
generally  the  best  for  the  remaining  settings.  Overall,  our  results  at¬ 
test  the  difficulty  of  the  task  [25],  but  they  also  show  some  promis¬ 
ing  directions  for  improvement.  In  particular,  the  availability  of 
suitable  training  data  should  allow  us  to  better  estimate  the  useful¬ 
ness  of  different  features  in  discriminating  between  blogs  fulfilling 
different  facet  inclinations. 


^Note  that  this  training  regime  is  not  applicable  for  the  uogTrFBHlr 
run,  as  it  is  independent  of  the  used  training  data. 


5.  BLOG  TRACK: 

TOP  STORIES  IDENTIFICATION  TASK 

In  the  top  stories  identification  task,  the  goal  is  to  produce  a  set  of 
important  headlines  (from  an  editorial  perspective)  and  associated 
blog  posts  in  relation  to  a  day  of  interest.  In  particular,  the  task 
involves,  for  each  query  day,  finding  the  most  important  headlines 
for  that  day,  and  then  selecting  ten  relevant  and  diverse  blog  posts 
for  each  of  those  headlines  [25].  We  divide  the  problem  into  two 
distinct  sub-tasks:  headline  ranking,  the  ranking  of  top  headlines 
for  the  query  day;  and  blog  post  selection,  where  we  select  a  diverse 
set  of  top  blog  posts  pertaining  to  a  headline. 

For  our  participation  in  this  task,  we  investigate  the  application 
of  the  Voting  Model  [20]  (see  Section  3.3)  to  the  headline  ranking 
problem.  For  blog  post  selection  for  a  given  headline,  we  explore 
diversity  by  promoting  relevant  yet  novel  blog  posts  in  the  ranking. 
In  particular,  we  explore  both  the  textual  and  temporal  dissimilarity 
between  blog  posts  as  evidence  for  diversification. 

5.1  Headline  Ranking 

The  aim  of  the  headline  ranking  sub-task  is  to  produce  a  set  of 
headlines  which  were  deemed,  from  an  editorial  perspective,  to  be 
important  on  the  query  day  dq,  using  evidence  from  the  blogo- 
sphere.  Our  headline  ranking  approach  is  based  on  the  intuition 
that,  on  any  day,  bloggers  will  create  posts  pertaining  to  prominent 
news  stories  for  that  day.  We  desire  to  score  a  headline  h  for  a  given 
query  day  dg,  which  we  denote  scoreih,  dg).  Our  basic  approach 
uses  the  Votes  voting  technique  (Equation  (6))  to  score  all  head¬ 
lines  published  on  day  dg  ±  1  (to  account  for  the  time  difference 
between  countries),  by  counting  the  number  of  blog  posts  mention¬ 
ing  the  headline  h  on  query  day  dg  (i.e.  from  the  ranking  of  blog 
posts  R{h)).  We  use  DPH  (Equation  (1))  for  ranking  blog  posts  in 
response  to  a  headline.  As  suggested  in  [20],  we  limit  the  number 
of  retrieved  blog  posts  to  \R{h)\  <  1000. 

However,  blog  posts  created  after  the  query  day  dg  may  also 
help  to  improve  the  accuracy  of  our  approach.  Our  intuition  is  that 
news  stories  will  often  be  discussed  afterwards  for  long  running, 
controversial  or  important  unpredictable  stories,  e.g.  the  aftermath 
of  a  terrorist  bombing.  Indeed,  by  taking  this  evidence  into  ac¬ 
count,  we  can  identify  those  stories  which  maintain  their  interest 
over  time,  and  as  such  can  be  deemed  more  important.  In  particu¬ 
lar,  [16]  suggested  that  bursts  in  term  distributions  could  last  for  a 
period  of  time.  Hence,  in  the  following,  we  define  two  alternative 


Run 

Headline  Ranking 

Blog  Post  Selection 

uogTrTsbmmr 

uogTrTswtime 

uogTrTstimes 

uogTrTSemmrs 

DPH  +  Votes.  |R(/i)|=1000  (baseline) 

+  NDayBoost(n=7) 

+  CE(Wikipedia,10  terms) 

+  CE(Wikipedia,10  terms)  +  GaussBoost(w=4,n=14) 

DPH  +  Diversify(MMR) 

DPH  -t  Diversify(Time) 
MergedSubRankings(DPH)  +  Diversify(Time) 
MergedSubRankings(DPH)  +  Diversify(MMR) 

Table  4:  Summary  of  submitted  runs  to  the  top  stories  identification  task  of  the  Blog  track. 


techniques  for  calculating  score{h,  dq),  which  leverage  the  tem¬ 
poral  distribution  of  each  headline  h  over  time.  In  particular,  these 
techniques  accumulate  vote  evidence  from  the  days  preceding  or 
following  dq,  to  ‘boost’  the  score  of  headlines  which  retain  their 
importance  over  multiple  days. 

In  our  first  proposed  temporal  distribution  boosting  technique, 
NDayBoost,  we  linearly  combine  the  scores  for  the  following  n 
days  before  or  after  day  dq,  as: 

dg+n 

scoreNDayBoost(h,dq)  ^  ^  \R{h,d)\  (8) 

d=dQ 

where  \R{h,  d)|  measures  the  importance  of  headline  h  on  day  d, 
n  is  a  parameter  controlling  the  number  of  days  before  (n  <  0)  or 
after  (n  >  0)  dq  to  take  into  account,  while  d  represents  any  single 
day.  Note  that  this  technique  places  equal  emphasis  on  all  days  d  - 
we  expect  the  distribution  of  \R{h,  d)\  to  peak  around  day  dq. 

Importantly,  this  approach  can  incorporate  evidence  from  multi¬ 
ple  days.  However,  due  to  the  linear  nature  of  the  score  aggregation, 
all  days  are  treated  equally,  when  it  is  intuitive  to  think  that  days 
more  distant  from  dq  will  provide  poorer  evidence. 

To  address  this,  we  propose  a  second  temporal  distribution  boost¬ 
ing  technique.  In  particular,  GaussBoost  is  similarly  based  upon 
the  intuition  that  important  stories  will  run  for  multiple  days.  How¬ 
ever,  instead  of  judging  each  subsequent  day  equally,  we  weight 
based  on  the  time  elapsed  from  the  day  of  interest  dq,  using  a 
Gaussian  curve  to  define  the  magnitude  of  emphasis.  In  this  way, 
we  state  a  preference  for  stories  that  were  most  important  around 
dq,  rather  than  stories  which  peaked  some  time  before/after  dq: 

dQ+m 

scoreGauaaBooat{h,dq)  =  ^  Gauss{d  —  dq)  ■  \R(h,  d)\  (9) 

d=dQ 

where  m  is  the  maximum  number  of  days  before  or  after  dq  to  take 
into  account  and  d  —  dq  is  the  number  of  days  elapsed  since  the 
day  of  interest  dq  (0  <  dq  <  m).  Gauss  (Ad)  is  the  Gaussian 
curve  value  for  a  difference  of  days  Ad,  as  given  by: 

Gauss{Ad)  =  — ■  exp  (10) 

w.vStt  (2w)-‘ 

where  w  defines  the  width  of  the  Gaussian  curve.  A  smaller  w 
will  emphasise  stories  closer  to  dq,  while  a  larger  w  will  take  into 
account  stories  on  more  distant  days,  up  to  the  maximum  m  days. 

It  should  also  be  noted  that  the  original  headlines  provided  for 
this  task  contain  many  non-news  entries  (e.g.  paid  death  notices, 
corrections,  etc).  We  apply  a  small  set  of  heuristics  to  the  headline 
corpus  beforehand  to  remove  these  spurious  entries,  on  the  intuition 
that  these  headlines  can  never  be  deemed  important.  Furthermore, 
as  a  means  to  counter  term  sparsity  in  the  headlines,  we  investigate 
the  usefulness  of  collection  enrichment  [18,  21,  31]  in  this  domain. 
Indeed,  expanding  queries  based  on  a  higher  quality,  external  re¬ 
source  has  been  shown  to  be  more  effective  than  doing  so  on  the 
local  collection,  since  blog  posts  are  often  noisy  [14].  In  particu¬ 
lar,  we  enrich  each  headline  from  Wikipedia  (as  extracted  from  the 
ClueWeb09  collection)  using  DPH  (Equation  (1))  for  retrieval  and 


Run 

Submitted 

Corrected 

TREC  best 

0.2600 

TREC  median 

0.0400 

uogTrTsbmmr 

0.1731 

N/A 

uogTrTswtime 

0.0795 

0.1812 

uogTrTstimes 

0.1862 

N/A 

uogTrTSemmrs 

0.1186 

0.1720 

Table  5:  Headline  ranking  MAP  performance  of  our  submit¬ 
ted  and  corrected  (where  applicable)  runs  for  the  top  stories 
identification  task  of  the  Blog  track. 


Bol  (Equation  (5))  to  select  the  top  10  terms  for  each  headline.  Our 
submitted  runs  are  summarised  in  Table  4. 

Table  5  presents  the  mean  average  precision  for  headline  rank¬ 
ing  over  our  four  submitted  runs.  From  the  results,  we  see  that  our 
baseline  (uogTrTsbmmr)  voting-based  approach  provides  a  strong 
performance  of  0.1731  MAP,  which  is  markedly  higher  than  the 
median  for  this  task.  Indeed,  all  of  our  submitted  runs  comfortably 
exceed  this  median.  Note  that,  for  our  boosting  runs  (uogTrT  swtime 
and  uogTrTSemmrs),  we  encountered  a  ‘long’  to  ‘int’  overflow 
bug,  which  affected  their  performance.  Once  this  was  corrected, 
their  performances  were  comparable  to  our  baseline,  as  shown  in 
the  corrected  column  of  Table  5.  Indeed,  uogTrTswtime  improved 
upon  our  baseline  ranking,  indicating  that  there  is  useful  evidence 
which  can  be  leveraged  to  improve  the  ranking  performance  from 
after  the  query  day.  Our  best  run  was  that  done  with  collection  en¬ 
richment  using  Wikipedia,  which  indicates  that,  indeed,  term  spar¬ 
sity  within  headlines  is  an  important  factor,  and  deserves  further 
investigation.  Moreover,  uogTrTswtime  proved  to  be  the  best  run 
at  the  TREC  2009  Blog  top  stories  task  [25]. 

5.2  Blog  Post  Selection 

The  goal  of  the  blog  selection  sub-task  is  to  retrieve  a  set  of  ten 
blog  posts  for  a  given  headline  which  are  both  relevant  to  this  head¬ 
line,  and  moreover  cover  as  large  a  variety  of  the  aspects  of  this 
headline  as  possible.  Using  DPH,  we  produce  a  first  ranking  of 
blog  posts  for  each  headline.  However,  there  is  also  additional  tem¬ 
poral  information  which  can  be  exploited  to  improve  upon  this  ini¬ 
tial  ranking.  During  the  headline  selection  sub-task,  our  approach 
generates  day-oriented  blog  post  rankings  for  each  headline  -  i.e. 
for  day  dq,  the  top  blog  posts  (if  any)  which  match  each  retrieved 
headline  h.  We  exploit  this  to  create  a  second,  enhanced  blog  post 
ranking,  by  merging  some  of  these  day-oriented  blog  post  rank¬ 
ings  together,  keeping  only  the  top  scored  results.  In  particular, 
we  merge  the  rankings  for  the  day  of  the  headline,  with  those  for 
the  following  week.  In  this  way,  we  restrict  the  blog  posts  to  be 
selected  to  only  those  in  temporal  proximity  to  the  query  day  dq, 
on  the  intuition  that  these  will  more  likely  be  relevant,  while  still 
bringing  potentially  novel  information  as  the  story  develops. 

To  diversify  either  of  these  two  blog  post  rankings,  we  then  apply 
one  of  two  re-ranking  techniques:  diversification  through  textual 
dissimilarity  and  diversification  using  temporal  dissimilarity.  For 
textual  dissimilarity,  we  apply  the  Maximal  Marginal  Relevance 
(MMR)  [7]  method.  In  particular,  MMR  greedily  selects  a  docu- 


Run 

a-NDCG@  10 

IA-P@10 

TREC  best 

0.7723 

0.2758 

TREC  median 

0.0217 

0.0040 

uogTrTsbmmr 

0.518 

0.168 

uogTrTswtime 

0.297 

0.094 

uogTrTstimes 

0.449 

0.155 

uogTrTSemmrs 

0.371 

0.123 

Table  6:  Blog  post  selection  performance  of  our  submitted  runs 
for  the  top  stories  identification  task  of  the  Blog  track. 

ment  d*  from  the  initial  ranking  with  maximum  relevance  to  the 
query  (headline)  and  maximum  dissimilarity  to  the  previously  se¬ 
lected  documents  (blog  posts).  The  selection  criterion  used  by  the 
MMR  algorithm  is  defined  below: 

d*  =  arg  max[A  Simi  (di,  ft)  —  (1  —  A)  max  Sim2 (di,  dj)] 


where  i?  is  a  ranked  list  of  blog  posts,  ft  is  a  headline,  S  is  the 
subset  of  documents  in  R  already  selected,  and  i?  \  S'  is  the  set  dif¬ 
ference,  i.e,  the  documents  not  yet  selected.  Simi  is  the  similarity 
metric  used  in  document  retrieval  (i.e.  DPH),  and  Sim2  is  the  sim¬ 
ilarity  between  documents  di  and  dj,  which  can  be  computed  by 
the  same  metric  used  for  Simi  or  a  different  one.  In  particular,  we 
use  the  cosine  distance  between  vector  representations  of  the  blog 
posts  di  and  dj,  weighted  by  DPH. 

For  temporal  dissimilarity,  we  develop  a  novel  time-based  di¬ 
versification  approach,  which  exploits  the  evolution  of  a  story  over 
time.  The  intuition  is  that,  as  the  story  progresses,  different  view¬ 
points  will  be  expressed  and  new  actors  will  arrive.  Hence,  to  tmly 
provide  an  overview  of  a  particular  story,  we  hypothesise  that  blog 
posts  should  be  selected  over  time.  To  promote  a  wide  variety  of 
blog  posts  over  the  course  of  the  story,  we  select  blog  posts  with 
increasing  temporal  distance  from  the  headline  time.  In  particular, 
we  incrementally  select  blog  posts  published  at  least  6  hours  apart. 
Our  submitted  runs  are  listed  in  the  last  column  of  Table  4. 

Our  results  are  shown  in  Table  6.  We  can  see  that  all  our  re¬ 
sults  outperform  the  TREC  median  by  a  large  margin,  with  our 
best  mn  (uogTrTsbmmr)  achieving  0.518  a-NDCG@10.  Indeed, 
it  was  the  best  top  news  stories  identification  task  mn  at  TREC 
2009  [25].  Moreover,  both  maximal  marginal  relevance  (uogTrTs¬ 
bmmr)  and  temporal  diversification  (uogTrTstimes)  proved  to  be 
effective  techniques  when  applied  on  our  baseline  DPH  blog  post 
ranking.  In  contrast,  mns  using  our  merged  blog  post  rankings 
(uogTrTswtime  and  uogTrTSemmrs)  were  less  effective.  However, 
it  is  unclear  whether  their  performance  is  due  to  the  method  itself, 
or  to  the  input  data  from  the  headline  ranking,  which  was  the  sub¬ 
ject  of  the  overflow  bug  mentioned  earlier.  In  point  of  fact,  further 
investigation  confirmed  that  indeed  the  input  data  was  to  blame. 

6.  ENTITY  TRACK 

In  the  new  Entity  track,  the  goal  is  to  retrieve  entities  of  a  par¬ 
ticular  type  (people,  organisations,  or  products)  that  are  somehow 
related  to  an  input  entity  in  the  query  [4].  Our  major  goal  in  this 
track  was  to  extend  our  Voting  Model  to  the  task  of  finding  re¬ 
lated  entities  of  the  desired  type.  Our  approach  builds  a  seman¬ 
tic  relationship  support  for  the  Voting  Model,  by  considering  the 
co-occurrences  of  query  terms  and  entities  within  a  document  as  a 
vote  for  the  relationship  between  these  entities  and  the  one  in  the 
query.  Additionally,  on  top  of  the  Voting  Model,  we  develop  novel 
techniques  to  further  enhance  the  initial  vote  estimations.  In  par¬ 
ticular,  we  promote  entities  associated  to  authoritative  documents 


or  documents  from  the  same  community  as  the  query  entity  in  the 
hyperlink  structure  underlying  the  ClueWeb09  collection. 

Eirstly,  in  order  to  identify  entities  in  the  category  B  subset  of 
the  corpus,  we  resort  to  an  efficient  dictionary-based  named  en¬ 
tity  recognition  approach."^  In  particular,  we  build  a  large  dictio¬ 
nary  of  entity  names  using  DBPedia,^  a  structured  representation  of 
Wikipedia.  Dictionary  entries  comprise  all  known  aliases  for  each 
unique  entity,  as  obtained  from  DBPedia  (e.g.,  ‘Barack  Obama’ 
is  represented  by  the  dictionary  entries  ‘Barack  Obama’  and  ‘44th 
President  of  the  United  States’).  In  order  to  differentiate  between 
the  entity  types  of  interest  in  this  task,  DBPedia  names  are  further 
categorised  as  people,  organisations,  or  products,  based  on  each  en¬ 
tity’s  category  description  on  DBPedia  and  several  heuristics  (for 
instance,  the  occurrence  of  the  clue  word  ‘company’  is  likely  to 
identify  organisations).  In  order  to  account  for  people  that  do  not 
have  a  Wikipedia  page,  entries  in  the  produced  dictionary  are  com¬ 
plemented  with  common  proper  names  derived  from  the  US  Census 
data.®  After  being  identified,  entity  name  occurrences  in  the  corpus 
are  recorded  in  appropriate  index  structures,  so  as  to  make  this  in¬ 
formation  efficiently  available  at  querying  time.  By  doing  so,  a  rich 
profile  is  built  for  every  unique  entity,  comprising  the  documents  in 
which  the  entity  occurs  in  the  corpus. 

Additionally,  in  order  to  find  the  correct  homepages  for  each  re¬ 
trieved  entity,  we  again  resort  to  DBPedia.  In  particular,  for  some 
catalogued  entities,  DBPedia  includes  a  set  of  associated  docu¬ 
ments,  which  correspond  to  external  (i.e.,  non- Wikipedia)  pages 
linked  to  from  each  entity’s  Wikipedia  page,  and  are  more  likely 
to  correspond  to  the  desired  homepages  for  that  entity.  Eor  entities 
with  no  such  associated  documents  and  also  for  non-DBPedia  enti¬ 
ties,  we  simply  retrieve  the  top  scored  documents  from  the  entities’ 
profile  as  their  candidate  homepages. 

At  querying  time,  we  experiment  with  different  approaches  that 
refine  the  initial  ranking  of  documents  for  a  given  query.  Eirstly,  on 
top  of  the  DPH  weighting  model  (Equation  (1)),  we  apply  the  pBiL 
proximity  model  (Equation  (4)),  in  order  to  favour  documents  in 
which  the  query  terms  occur  in  close  proximity.  This  can  be  partic¬ 
ularly  beneficial,  as  the  queries  in  this  task  include  named  entities. 
Additionally,  in  an  attempt  to  promote  authoritative  homepages  at 
the  document  ranking  level,  we  integrate  a  document  indegree  fea¬ 
ture,  computed  on  the  hyperlink  graph  underlying  the  category  B 
subset  of  the  ClueWeb09  collection.  Alternatively,  we  experiment 
with  a  state-of-the-art  community  detection  technique  [5],  in  order 
to  favour  documents  from  the  same  community  as  those  associ¬ 
ated  to  the  input  entity.  By  doing  so,  we  expect  to  promote  the 
entities  associated  to  these  documents,  with  the  intuition  that  these 
entities  are  more  likely  to  be  related  to  the  input  entity.  On  top  of 
the  document  ranking  produced  by  either  of  these  techniques,  the 
expCombMNZ  voting  technique  (Equation  (7))  is  then  applied  to 
produce  a  ranking  of  entities,  generating  the  following  runs: 

1.  uogTrEbl  is  a  baseline  run,  which  applies  the  DPH  weight¬ 
ing  model  and  the  expCombMNZ  voting  technique. 

2.  uogTrEpr  applies  the  pBiL  proximity  model  at  the  docu¬ 
ment  ranking  level,  with  a  window  size  ws  —  4. 

3.  uogTrEdi  integrates  the  indegree  feature  to  the  document 
ranking  using  ELOE,  with  the  settings  suggested  in  [9]. 

4.  uogTrEcS  promotes  entities  associated  to  documents  from 
the  same  community  as  those  associated  to  the  input  entity. 

“^http  :  / /alias-i  .  com/lingpipe 
®http : / /dbpedia . org 

®http : / /www . census . gov/ genealogy/names/ 
names_files .html 


Table  7  summarises  our  submitted  mns,  while  Table  8  presents 
their  results  in  terms  of  normalised  discounted  cumulative  gain 
(NDCG)  at  R,  where  R  is  the  number  of  primary  and  relevant  doc¬ 
uments  (i.e.,  homepages),  precision  at  10  (P@10),  and  the  total 
number  of  relevant  (rel)  and  primary  (pri)  homepages  retrieved. 


Run 

Description 

uogTrEbl 

uogTrEpr 

uogTrEdi 

uogTrEc3 

DPH-rexpCombMNZ 

DPH-rpBiL+expCombMNZ 

DPH-rindegree+expCombMNZ 

DPH-rcommunities-rexpCombMNZ 

Table  7:  Submitted  runs  to  the  Entity  track. 


From  the  results  in  Table  8,  we  observe  that  all  our  mns  per¬ 
form  well  above  the  median  of  the  participant  groups  according  to 
both  NDCG@R  and  P@10.  Moreover,  all  four  runs  achieved  by 
far  the  best  performance  among  the  TREC  participants  in  terms  of 
the  number  of  relevant  and  primary  homepages  associated  with  the 
retrieved  entities  [4],  hence  attesting  the  strength  of  our  baseline 
approach.  Additionally,  the  integration  of  the  document  indegree 
feature  further  improved  over  our  strongly  performing  baseline  in 
terms  of  P@10.  Moreover,  applying  proximity  at  the  document 
ranking  level  brought  improvements  in  terms  of  both  NDCG@R 
and  P@  10,  as  did  our  community-based  boosting  technique.  Over¬ 
all,  these  results  not  only  attest  the  effectiveness  of  our  Voting 
Model  extension  for  this  task,  but  also  demonstrate  its  promise  as  a 
general  framework  for  entity-related  search  tasks. 


Run 

NDCG@R 

P@10 

EOSI 

TREC  best 

0.4098 

0.3500 

TREC  median 

0.0751 

uogTrEbl 

344 

uogTrEpr 

0.2662 

0.1200 

347 

WuM 

uogTrEdi 

0.2502 

0.1150 

343 

74 

uogTrEc3 

0.2604 

0.1200 

331 

75 

Table  8:  Results  of  submitted  runs  to  tbe  Entity  track. 


7.  WEB  TRACK:  ADHOC  TASK 

In  the  adhoc  task  of  the  Web  track,  participants  aimed  to  identify 
topically  relevant  documents  on  both  the  category  B  (50  million 
documents)  and  category  A  (500  million  documents)  subsets  of  the 
ClueWeb09  corpus  [8].  In  this  task,  we  aimed  to  test  our  DFR 
models,  and  the  Terrier  IR  platform  on  this  larger  corpus. 

In  particular,  we  submitted  three  runs  to  the  adhoc  task.  Two 
of  these  were  for  category  B,  one  for  category  A.  For  all  rans,  we 
applied  the  DPH  DFR  model  (Equation  (1)).  In  particular,  the  sub¬ 
mitted  runs,  and  an  unsubmitted  baseline  are  described  below: 

•  uogTrdpb  is  our  unsubmitted  baseline,  and  uses  DPFI  only. 

•  uogTrdpbP  adds  the  pBiL  proximity  model  (Equation  (4)), 
with  window  size  ws  =  4,  to  the  scores  generated  by  DPH. 

•  uogTrdpbA  tests  the  simple  use  of  anchor  text,  by  uniformly 
combining  scores  from  content  and  anchor  text  indices. 

•  uogTrdpbCEwP  uses  collection  enrichment  (CE)  [18,  21, 
31],  by  expanding  the  queries  from  documents  retrieved  only 
from  the  Wikipedia  portion  of  ClueWeb09.  The  Bol  term 
weighting  model  (Equation  (5))  is  used  to  weight  terms  in  the 
pseudo-feedback  documents.  Additionally,  the  pBiL  proxim¬ 
ity  model  is  also  applied  by  this  mn,  with  ws  =  4. 


A  summary  of  our  submitted  runs  is  given  in  Table  9.  Their  re¬ 
trieval  performance  is  provided  in  Table  10.  For  the  category  B 
mns,  we  note  the  following:  our  DPH  weighting  model  baseline 
mn  (uogTrdph)  performed  well  above  median;  applying  collec¬ 
tion  enrichment  and  proximity  (uogTrdphCEwP)  improved  upon 
the  baseline;  however,  the  simplistic  combination  of  anchor  text 
with  content  used  by  mn  uogTrdphA  was  detrimental  to  retrieval 
performance.  For  the  category  A  mns,  our  retrieval  performance 
was  roughly  median.  On  a  closer  inspection  of  our  category  A  run, 
we  found  that  it  suffered  from  retrieving  many  spam  Web  pages. 
Hence,  in  the  future,  we  will  investigate  the  application  of  tech¬ 
niques  to  remove  spam,  and/or  identify  high  quality  documents. 


Run 

Submitted 

Cat. 

Description 

uogTrdph 

X 

A/B 

DPH(content) 

uogTrdphP 

✓ 

A 

+pBiL 

uogTrdphA 

✓ 

B 

+DPH(anchor  text) 

uogTrdphCEwP 

✓ 

B 

+CE+pBiL 

Table  9:  Submitted  and  unsubmitted  runs  to  the  adhoc  task  of 
the  Web  track,  including  the  category  of  each  run. 


B  runs 

Submitted 

statMAP 

statNDCG 

TREC  best 

0.4305 

0.6091 

TREC  median 

0.1539 

0.2956 

uogTrdph 

X 

0.1970 

0.3096 

uogTrdphA 

✓ 

0.1825 

0.3245 

uogTrdphCEwP 

✓ 

0.2072 

0.3934 

A  mns 

P@5 

P@10 

TREC  best 

TREC  median 

BoilRiM 

uogTrdph 

X 

0.0650 

0.0969 

uogTrdphP 

✓ 

0.1600 

0.1660 

Table  10:  Results  of  submitted  and  unsubmitted  runs  to  the 
adhoc  task  of  the  Weh  track. 

Overall,  our  results  in  this  task  show  the  promise  of  the  Terrier 
IR  platform  and  the  DFR  weighting  models  for  larger  corpora,  even 
without  any  training,  since  our  participation  in  TREC  2009  relied 
solely  on  parameter-free  models.  Additionally,  we  believe  we  can 
enhance  our  retrieval  performance  by  applying  field-based  weight¬ 
ing  models  (e.g.  PL2F  [19]),  and,  particularly  on  the  A  subset, 
developing  spam  detection  techniques. 

8.  WEB  TRACK:  DIVERSITY  TASK 

The  goal  of  the  diversity  task  of  the  Web  track  is  to  produce 
a  ranking  of  documents  that  (1)  maximises  the  coverage  and  (2) 
reduces  the  redundancy  of  the  retrieved  documents  with  respect  to 
the  possible  aspects  underlying  a  query,  in  the  hope  that  users  will 
find  at  least  one  of  these  documents  to  be  relevant  to  this  query  [8]. 

In  our  participation  in  this  task,  we  propose  to  explicitly  take 
into  account  the  possible  aspects  underlying  a  query,  in  the  form 
of  sub-queries  [34,  35].  By  estimating  the  relevance  of  the  re¬ 
trieved  documents  to  individual  sub-queries,  we  seek  to  produce 
a  re-ranking  of  these  documents  that  maximises  the  coverage  of  the 
aspects  underlying  the  initial  query,  while  reducing  its  redundancy 
with  respect  to  already  well  covered  aspects.  In  particular,  we  ex¬ 
periment  with  our  novel  framework  for  search  result  diversification, 
called  xQuAD  (eXplicit  Query  Aspect  Diversification).  Given  a 
query  Q,  and  an  input  ranking  R{Q),  xQuAD  iteratively  builds  a 
result  ranking  S{Q)  by  selecting,  at  each  iteration,  the  document 
d*  e  R{Q)  \  S{Q)  with  the  highest  score,  as  given  by: 


d* 


arg  max  r\ 
deR(Q)\S(Q) 


id,Q) 

Q'eG(Q) 


i{Q',  Q)r2{d,  Q') 
m{Q',S{Q)) 


(12) 


Run 

Cat. 

rn.2}id,Q) 

G{Q) 

UogTrDYScdA 

A 

DPH 

sWQ 

'i’u 

UogTrDPCQcdB 

B 

DPH+pBiL 

cQE 

'^u 

UogTrDYCcsB 

B 

DPH+pBiL 

sWQ 

ic 

where: 


•  ri(d,  Q)  is  the  relevance  of  document  d  with  respect  to  the 
initial  query  Q,  as  estimated  hy  any  retrieval  approach,  such 
as  the  DPH  document  weighting  model  (Equation  (1)), 

•  G{Q)  is  the  set  of  sub-queries  Q'  associated  to  Q, 

•  *(Q^  Q)  is  the  estimated  importance  of  the  sub-query  Q'  rel¬ 
atively  to  all  sub-queries  associated  to  Q, 

•  t'2(d,  QO  is  the  relevance  of  document  d  to  the  sub-query 
Q',  as  estimated  by  any  retrieval  approach  (not  necessarily 
the  same  used  for  ri(d,  Q')),  and 

•  m{Q' ,  S{Q))  estimates  the  amount  of  information  satisfying 
the  sub-query  Q'  present  in  the  documents  already  selected 
in  S{Q),  as  a  measure  of  novelty. 


In  our  experiments,  the  G{Q)  component  is  based  on  query  sug¬ 
gestions  provided  by  a  major  Web  search  engine  for  each  of  the 
TREC  2009  Web  track  topics.  Alternatively,  we  investigate  a  new 
cluster-based  query  expansion  technique  aimed  at  generating  sub¬ 
queries  from  the  target  collection  itself.  In  particular,  we  cluster 
the  top  retrieved  results  for  an  initial  query  using  the  fc-means  al¬ 
gorithm  [26],  and  then  generate  different  sub-queries  by  expanding 
the  initial  query  from  each  individual  cluster. 

As  for  the  importance  component,  ^(Q^  Q),  we  propose  a  sim¬ 
ple  baseline  estimation  mechanism,  iu{Q' ,  Q),  which  considers  a 
uniform  importance  distribution  over  sub-queries: 

where  |G(Q)|  is  the  number  of  sub-queries  generated  for  query  Q. 
Alternatively,  we  experiment  with  biasing  the  diversification  pro¬ 
cess  towards  those  sub-queries  likely  to  represent  more  plausible 
aspects  of  the  initial  query.  Inspired  by  a  state-of-the-art  resource 
selection  technique  [37],  we  estimate  the  relative  importance  of 
each  generated  sub-query,  by  considering  the  ranking  produced  for 
this  sub-query  as  a  sample  of  the  documents  it  covers  in  the  whole 
collection.  In  particular,  we  estimate  the  importance  ^c(Q^  Q)  of 
the  sub-query  Q'  as: 


ic{Q',Q) 


n(g') _ l_ 

ma'XQ/gG(Q)n(gO  n(Q') 


T-j{d,Q), 

d\r2{d,Q')>Q 


(14) 


where  r2(d,  Q')  is  as  described  above,  n{Q')  is  the  total  number  of 
results  associated  with  the  sub-query  Q' ,  n(Q')  corresponds  to  the 
number  of  results  associated  to  Q'  that  are  among  the  top  r  ranked 
results  for  the  initial  query  Q,  with  j{d,  Q)  giving  the  ranking  po¬ 
sition  of  the  document  d  with  respect  to  Q. 

Finally,  the  novelty  component  m(Q' ,  S(Q))  is  estimated  as  the 
number  of  documents  retrieved  for  the  sub-query  Q'  that  are  among 
the  already  selected  documents  in  S'(g). 

In  our  submitted  runs,  we  use  the  DPH  weighting  model  to  pro¬ 
duce  the  initial  baseline  ranking,  and  also  the  ranking  for  each  iden¬ 
tified  sub-query,  for  category  A.  For  category  B,  DPH  is  used  along 
with  the  pBiL  proximity  model,  with  a  window  size  ws  =  4.  On 
top  of  the  initial  baseline,  we  experiment  with  the  different  compo¬ 
nents  of  our  proposed  framework  to  produce  diverse  rankings  with 


Table  11:  Submitted  runs  to  the  diversity  task  of  the  Web  track, 
including  the  category  of  each  rnn. 


r  =  1000  documents,  resulting  in  the  following  three  submitted 
runs,  summarised  in  Table  11: 

1.  uogTrDYScdA  retrieves  documents  from  the  whole  of  Clue- 
Web09,  and  then  re-ranks  these  using  query  suggestions  from 
a  major  Web  search  engine  as  sub-queries  (denoted  sWQ), 
weighted  by  the  importance  estimator. 

2.  uogTrDPCQcdB  investigates  generating  sub-queries  from 
the  collection  itself,  by  applying  the  previously  described 
cluster-based  query  expansion  technique  (denoted  cQE).  The 
Bol  term  weighting  model  (Equation  (5))  is  used  to  pro¬ 
duce  sub-queries  from  each  cluster  generated  by  fc-means 
(fc  =  10)  from  a  baseline  ranking  of  1000  documents.  In 
our  experiments,  a  maximum  of  10  terms  are  expanded  from 
the  3  highly  scored  documents  in  each  cluster,  so  as  to  form 
a  sub-query.  The  iu  importance  estimator  is  used  once  again. 

3.  uogTrDYCcsB  uses  the  same  sub-queries  as  uogTrDYScdA 
(i.e.,  sWQ),  but  with  the  importance  of  each  sub-query  esti¬ 
mated  by  our  resource  selection-inspired  technique,  ic. 

Table  12  shows  the  performance  of  our  runs  in  the  diversity  task, 
in  terms  of  a  normalised  discounted  cumulative  gain  (a-NDCG) 
and  intent-aware  precision  (lA-P).  From  the  table,  we  observe  that 
all  our  runs  perform  well  above  the  median  of  the  TREC  partici¬ 
pants,  for  both  category  A  and  B  settings,  and  in  terms  of  both  mea¬ 
sures.  Indeed,  run  uogTrDYCcsB  was  the  best  performing  among 
all  participating  runs  for  category  B,  in  terms  of  both  a-NDCG  and 
lA-P  [8].  Notwithstanding,  there  is  still  scope  for  improvements, 
as  demonstrated  by  a  further  analysis  of  the  individual  components 
underlying  our  framework,  their  own  performance,  and  their  con- 
trihution  to  the  performance  of  the  approach  as  a  whole  [35]. 


Run 

Cat. 

a-NDCG@  10 

1A-P@10 

TREC  best 

A 

0.5144 

0.2105 

TREC  median 

A 

0.1324 

0.0541 

UogTrDYScdA 

A 

0.1910 

0.0770 

TREC  best 

B 

0.5267 

0.2447 

TREC  median 

B 

0.1758 

0.0733 

UogTrDPCQcdB 

B 

0.2710 

0.1340 

UogTrDYCcsB 

B 

0.2820 

0.1320 

Table  12:  Retrieval  performances  of  our  submitted  runs  to  the 
diversity  task  of  the  Web  track. 

9.  MILLION  QUERY  TRACK 

In  the  Million  Query  track,  participants  aimed  to  identify  topi¬ 
cally  relevant  documents  for  many  queries,  on  both  the  category  B 
(50  million  documents)  and  category  A  (500  million  documents) 
subsets  of  the  ClueWeb09  corpus.  We  submitted  two  runs  to  the 
Million  query  track,  to  see  how  the  performance  of  these  runs  dif¬ 
fered  from  the  performance  of  the  equivalent  runs  submitted  to  the 
adhoc  task  of  the  Web  track.  The  runs  were: 

•  uogTRMQdph40  applies  DPH  on  the  content-only  index 
(40,000  queries).  This  run  corresponds  to  run  uogTrdph  from 
the  adhoc  task  of  the  Web  track  (see  Section  7). 


•  uogTRMQdpAlO  applies  DPH,  combining  scores  from  body 
and  anchor  text  indices  (10,000  queries).  This  run  corre¬ 
sponds  to  run  uogTrdphA  from  the  adhoc  task  of  the  Web 
track  (see  Section  7). 

Table  13  summarises  the  obtained  retrieval  performance  of  the 
runs  submitted  to  the  Million  Query  track.  We  note  that  the  retrieval 
performance  of  both  runs  are  markedly  above  the  median  mea¬ 
sures,  particularly  on  the  statMAP  measure.  Anchor  text  makes  no 
marked  benefit  to  eMAP  performance,  but  is  detrimental  to  statMAP 
performance,  mirroring  our  observations  from  Section  7. 


Run 

StatMAP 

eMAP 

TREC  best 

TREC  median 

0.5378 

0.1535 

0.1592 

0.0591 

uogTRMQdpAlO 

uogTRMQdph40 

0.2612 

0.2339 

0.0869 

0.0881 

Table  13:  Summary  of  retrieval  performance  of  our  submitted 
Million  Query  track  runs. 


10.  RELEVANCE  FEEDBACK  TRACK 

The  aim  of  the  TREC  2009  Relevance  Feedback  track  was  to 
examine  the  aspects  affecting  the  selection  of  good  feedback  doc¬ 
uments.  In  our  participation  in  this  track,  we  focus  on  the  stability 
of  our  query  expansion  Bol  DPR  term  weighting  model  (Equa¬ 
tion  (5))  across  different  feedback  identification  strategies. 

In  the  first  phase  of  the  track,  participants  submitted  5  Clue- 
Web09  category  B  documents  to  be  assessed  for  each  topic.  Our 
first  feedback  set,  ugTr.  1,  was  created  using  the  DLH13  model  [21]. 
Our  second  feedback  set,  ugTr.2,  was  created  using  the  DPH  model 
(and  hence  corresponds  to  run  uogTrdph  from  Section  7).  Table  14 
reports  the  P@5  performances  of  our  submitted  feedback  sets,  and 
also  of  several  other  phase  2  feedback  sets.  From  the  table,  we  note 
that  the  ugTr.l  compared  well  with  the  other  feedback  sets,  while 
ugTr.2  was  the  best  performing  of  this  selection  of  feedback  sets. 


Feedback  Set 

P@5 

UgTr.  1 

0.320 

UgTr.2 

0.504 

CMU.l 

0.340 

UMas.l 

0.496 

UPD.l 

0.460 

YUIR 

0.252 

hit2.1 

0.320 

ilps.2 

0.368 

Table  14:  Retrieval  performances  of  our  submitted  phase  1 
feedback  sets,  and  our  phase  2  allocated  feedback  sets  for  the 
Relevance  Feedback  track. 

In  the  second  phase,  participants  submitted  one  tun  for  each  of 
the  8  feedback  sets  assigned  to  them.  We  used  only  the  relevant 
documents  from  each  feedback  set,  and  ranked  documents  from  the 
category  B  collection  using  the  DPH  document  weighting  model 
(i.e.  based  on  the  Web  track  adhoc  task  baseline  uogTrdph).  We 
used  the  Bol  term  weighting  model  (Equation  (5))  to  identify  and 
weight  the  10  most  informative  terms  to  expand  the  query  with. 

Unfortunately,  our  second  phase  relevance  feedback  runs  en¬ 
countered  a  bug  in  our  mapping  from  ‘DOCNO’  to  internal  docid, 
and  as  a  consequence,  the  actual  used  feedback  documents  were  in¬ 
correct.  Hence,  in  the  following,  we  report  the  performance  of  our 
submitted  runs,  and  the  correct  retrieval  performances.  Table  15 


Run 

Submitted 
StatMAP  eMAP 

Corrected 

StatMAP 

ugTr.CMU.l 

0.1764 

0.0409 

0.2492 

UgTr.  UMas.l 

0.1958 

0.0421 

0.2373 

ugTr.UPD.l 

0.1715 

0.0399 

0.2325 

UgTr.  YUIR 

0.1900 

0.0460 

0.2055 

ugTr.hit2.1 

0.1667 

0.0414 

0.2578 

ugTr.ilps.2 

0.1756 

0.0407 

0.2212 

ugTr.ugTr.  1 

0.2081 

0.0464 

0.2240 

ugTr.ugTr.2 

0.1810 

0.0409 

0.2349 

Table  15:  Retrieval  performances  of  our  allocated  feedback  sets 
for  the  Relevance  Feedback  track. 

reports  the  performance  of  these  submitted  and  corrected  runs,’ 
for  the  various  feedback  sets.  In  all  cases,  retrieval  performance 
was  detrimentally  impacted  by  the  presence  of  the  bug,  compared 
to  the  corrected  runs.  Moreover,  we  note  that  Bol  does  not  ap¬ 
pear  to  favour  the  feedback  sets  identified  by  the  DFR  weighting 
models  (i.e.  ugTr.l  c&  ugTr.2).  Indeed,  some  of  the  less  accurate 
feedback  sets  (see  Table  14)  perform  better  overall.  This  suggests 
that  some  of  these  sets  produce  better  feedback  documents  for  Bol 
than  DPH  or  DLH13  alone  [15].  Indeed,  Bol  performed  best  using 
the  hit2.1  feedback  set,  even  though  hit2.1  performed  worse  than 
UgTr.2  (hit2.1  exhibited  37%  less  P@5  than  ugTr.2). 

Overall,  we  conclude  that  our  participation  in  the  Relevance  Feed¬ 
back  track  has  facilitated  an  investigation  into  the  aspects  affecting 
the  Bol  term  weighting  model,  and  testified  to  its  suitability  for 
application  using  various  methods  for  generating  feedback  sets. 

11.  CONCLUSIONS 

In  TREC  2009,  we  participated  in  five  tracks,  namely  the  Blog, 
Entity,  Million  Query,  Relevance  Feedback  and  Web  tracks,  using 
our  Terrier  IR  platform.  In  particular,  our  participation  focused 
on  new  applications  for  the  Voting  Model,  as  well  as  on  fresh  ap¬ 
proaches  for  search  result  diversification. 

Our  results  for  the  Blog  track  top  news  stories  identification  task 
are  particularly  strong.  In  the  faceted  blog  distillation  task,  a  con¬ 
figuration  oversight  hindered  the  retrieval  performance  of  our  runs. 
Nevertheless,  our  corrected  results  show  very  good  promise,  but 
also  attest  that  this  task  remains  hard  without  suitable  training  data. 
In  the  Entity  track,  our  proposed  extension  to  the  Voting  Model  has 
been  shown  to  provide  a  very  effective  framework  for  tackling  the 
related  entity  finding  task.  In  the  diversity  task  of  the  Web  track, 
the  new  xQuAD  framework  shows  a  strong  retrieval  performance, 
with  promising  directions  for  further  improvements. 

Finally,  with  the  advent  of  several  larger  test  collections,  we  took 
the  opportunity  presented  by  TREC  2009  to  overhaul  the  Terrier 
platform,  for  instance  with  improved  MapReduce  indexing,  and 
scalable  retrieval.  However,  a  small  bug  in  the  improved  system  af¬ 
fected  our  submitted  Relevance  Feedback  track  runs.  On  the  larger 
ClueWeb09  category  A  collection,  our  runs  were  affected  by  the 
presence  of  spam  in  the  collection.  In  the  future,  we  will  endeav¬ 
our  to  develop  spam  detection  and  document  quality  features. 
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